Abstract

This article reviews key text-mining capabilities of the R language and proposes methodologies for improvements to text mining large volumes of content, especially email, and novel analytics and visualizations that will enhance the understanding of the content of a corpus. This article builds on existing R-language text-mining APIs to suggest more meaningful and readable analytical capabilities and visualizations. Assuming a general familiarity with the R language, the article starts by looking at baseline capabilities of text mining APIs, and gives examples of enhanced data-cleaning, normalization, data augmentation and visualization methodologies. The aim is to develop a more deliberate process for deriving more relevant information from unstructured data and email in particular. This article is intended as a preliminary study of several promising concepts; further research can to prove out quantitative efficacy of these methods in commercial applications.

Introduction

Large-scale, unstructured data management solutions often have sophisticated requirements for search, analytics, categorization, and data visualization. These applications include document management, ediscovery, knowledge management, and, most recently, data-privacy management solutions (for GDPR). Customers are initially presented with a promising demo, but find that in production use, the results can be lackluster at best. As a result, some high-potential analytic and data visualization features go largely unutilized in daily use. How can this gap between the demo and customer experience be closed?

Product demos use data that has been manicured to perfection or naturally contain the necessary semantic structure. In benign examples, demo data is clean by design, such as the case with template libraries or Wikipedia articles. Real-world data, however, is unpredictable, noisy, and lacks the semantic elegance of Wikipedia. Nowhere is this more evident than in email data that represents the majority of the data in many unstructured data management implementations.

Email presents some of the greatest challenges to text-mining applications:

  • Emails are hastily written, with spelling, grammar and punctuation errors
  • Email typically contains a large volume of templated information such as boilerplate language, standardized signatures, replies to calendar invites, etc.
  • Meaning is shrouded in emoticons and insider shorthand that is inconsistent and difficult to normalize over time
  • Discussion threads repeat content over and over, injecting undue weighting to increasingly dated content

This article is intended to present some practical methods and novel visualizations that help to deal with the noisiest of unstructured data. In particular, the data at issue is email that has been scanned as images, and then translated back to text through optical character recognition (OCR). Moreover, like an ediscovery case or knowledge-management solution, the corpus has a known theme that is understood from the outset, and its deficiencies are also well understood (redaction, classified or privileged data missing). The hope is that these methods can take on the challenge of a noisy dataset and illuminate content more quickly that requires further investigation.

The Data: The State Department (“Hillary Clinton”) Emails

Background

During the term of Secretary of State Hillary Clinton, she and her inner circle largely communicated on an email server located at clintonemail.com. Though not the first to use private email servers, Clinton’s particular deployment became the subject of scrutiny, and ultimately, an investigation as it was revealed that material that had been marked “Classified” had been either shared or discussed on an server managed and housed separately from normal Department of State secure network operations. As a result of the investigation, the emails were reviewed by Clinton’s lawyers, and were subsequently released to the public.

Format

Notably, the data, though originating as email, was not released as machine readable text (EML files). Part of the reason for this was the need for meticulous review and redaction of personal and classified material prior to public release. Instead, the data was released in Acrobat (PDF) files derived from image scanning; the only text available was from OCR processing of the data. Therefore, no structured metadata, nor 100% reliable means of extracting the original text was available to the public at the time of release.

The advantage of using this data is principally the challenge. Since the data has nearly zero internal structure, it helps to demonstrate the process, and the value, of programmatically adding structure and organization to this otherwise chaotic data.

Accessing the Data

The dataset that’s being used has already been extracted as OCR text, and collated into separately-released directories with document IDs as file names. Notably, the data is the raw OCR output, with many errors, which helps to serve as an example of how data cleaning, and effective processing can improve the overall results.

Download the full archive over all releases from GitHub.

Nature of the Data

The data contains 27,159 text files, each a single email or released document. The text includes various markings, including security classifications and Bates-stamping with a case # and a document # for review identification. Some pages have been redacted in full, or in part for security reasons. These markings are both helpful and bothersome as will become evident later, but that’s a reality with real-world data.

Preparing the R Environment

Requisite Libraries

The list of requisite libraries to run this project are as follows:

library(tm)
library(magrittr)
library(ggplot2)
library(ggfortify)
library(ggalt)
library(ggrepel)
library(scales)
library(plyr)
library(dplyr)
library(tidytext)
library(widyr)
library(factoextra)
library(circlize)
library(slam)
library(wordcloud)
library(textclean)
library(mgsub)
library(logging)
library(tidyr)
library(igraph)
library(ggraph)
library(ggthemes)
library(stringr)
library(anytime)
library(dygraphs)
library(lubridate)
library(anytime)
library(knitr)
library(kableExtra)

Working With the Raw Data

The data is first read into a VCorpus object, from which the “tm” package can access the document content:

#sourcePath = file.path(".", "SampleData", "HRC-DataHoarder-Github", "HRCEmail_JuneWeb")
sourcePath = file.path(".", "SampleData", "HRC-DataHoarder-Github")
list.dirs(sourcePath)
##  [1] "./SampleData/HRC-DataHoarder-Github"                             
##  [2] "./SampleData/HRC-DataHoarder-Github/Clinton_Email_August_Release"
##  [3] "./SampleData/HRC-DataHoarder-Github/HRC_Email_296"               
##  [4] "./SampleData/HRC-DataHoarder-Github/HRC_NDAS"                    
##  [5] "./SampleData/HRC-DataHoarder-Github/HRCEmail_DecWeb"             
##  [6] "./SampleData/HRC-DataHoarder-Github/HRCEmail_Feb13thWeb"         
##  [7] "./SampleData/HRC-DataHoarder-Github/HRCEmail_Jan29thWeb"         
##  [8] "./SampleData/HRC-DataHoarder-Github/HRCEmail_Jan7thWeb"          
##  [9] "./SampleData/HRC-DataHoarder-Github/HRCEmail_JulyWeb"            
## [10] "./SampleData/HRC-DataHoarder-Github/HRCEmail_JuneWeb"            
## [11] "./SampleData/HRC-DataHoarder-Github/HRCEmail_NovWeb"             
## [12] "./SampleData/HRC-DataHoarder-Github/HRCEmail_OctWeb"             
## [13] "./SampleData/HRC-DataHoarder-Github/HRCEmail_SeptemberWeb"
corpus = VCorpus(DirSource(sourcePath, recursive=TRUE))
corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 27159

There is a large volume of data, so some of the following operations will run very long, especially with insufficient memory and CPU. Experimentation with a single directory might be advisable if making modifications in some areas.

Once data is is loaded into the corpus object, any of the documents can be inspected as follows:

doc = corpus[[1]]$content
paste(doc, collapse=" ")
## [1] "      UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05765907 Date: 08/31/2015                                                                                                   RELEASE IN PART B5   From:                             Sullivan, Jacob J <SullivanJJ@state.gov> Sent:                             Friday, December 4, 2009 3:31 AM To: Subject:                          Iran    The EU meets in the coming days, and we are hoping for a strong public - and private - position on Iran. Bill has identified 5 countries that need touching to help drive a good outcome:  I know Huma has discussed with you, but a 2-minute discussion with each that underscores the key points reflected on your card would do the trick, if you can swing it.  Tx. Also, the intervention, with your modifications, turned out well. The process, in this case, did not generate a good • enough product -- I tried to make it clearer and stronger this morning and your amendments helped a lot.           UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05765907 Date: 08/31/2015 \f"

Basic R Tools for Bag of Words Text Mining

Data Preparation

The first step in any data text-mining effort is to process the data. This involves removing noise such as stop words and numbers from the data, and to normalize the data by setting all the data to lower case, and “stemming” words to their root forms (“discussing” and “discussion” become the root “discuss”). Initially, the standard tools are used, but later in the article, more drastic measures are taken to increase the signal-to-noise ratio and to enable more interesting visualizations.

The next set of examples works with 3 sets of the same data, with various levels of processing:

  • “corpus”: The original object that was loaded from the raw data
  • “cleanset”: A copy of the corpus with stop words and numbers removed, and fully stemmed
  • “cleanset.preStem”: A copy of cleanset just before the stemming step, so as to preserve some word endings to make visualization and reporting a little more meaningful.
[R SOURCE: Data Prep Processing]
# The order of operations can become important here; each step has the potential to affect the next. 

cleanset <- corpus
cleanset <- tm_map(cleanset, content_transformer(tolower)) 
cleanset <- tm_map(cleanset, removeNumbers, ucp = FALSE)  
cleanset <- tm_map(cleanset, stripWhitespace) 

#let's load a much longer list of stopwords
english.stopwords.large = as.character(read.csv(file="long-stopwords.txt", header=FALSE)[,1])
cleanset <- tm_map(cleanset, removeWords, english.stopwords.large)

cleanset <- tm_map(cleanset, removePunctuation)

cleanset.preStem <- cleanset
cleanset <- tm_map(cleanset.preStem, stemDocument, language="en") 

CorpusSummary: Encapsulation of Basic Text Mining

For the purposes of this article, some of the more basic operations have been encapsulated in a class called “CorpusSummary” that takes as input the cleanset corpus object just created. The source for this class can be downloaded from the related GitHub Project. Once downloaded, include it with the following:

source("corpusSummary.R")  

This is not a formal, available package, but is included as source code with the article to illustrate the overall process. The class performs many key functions on a corpus, returning any of the following information:

  • DocumentTermMatrix: This is a memory efficient object, that at its most basic, can be used as a simple search tool, cross referencing every term (word) in cleanset with every document number that it appears, along with a weighted frequency for the appearance of that word.
  • Sparse DocumentTermMatrix: There are over 100,000 individual terms in the corpus, but the vast majority of those terms appear only in a few documents each. In some cases, a more efficient look at the corpus is possible by creating a sparse matrix, which stores only the most frequently-occurring terms. This is useful because it speeds up operation, greatly reduces memory consumption, and most importantly, this is the only data we need to perform many of the follow on operations.
  • Word Frequency Table: This is a term-only extraction of the DocumentTermMatrix that includes the corpus-wide frequencies for the entire corpus.
  • Distance Table: This is a cross reference of each term against all the remaining terms, including a distance measure (default: “euclidian”) for each such cross reference that is based on the frequency of co-occurrence. In short, it’s a measure of similarity.
  • Kmeans Results: Kmeans is an unsupervised clustering algorithm that is used, in this case, to cluster the terms into related groups. This grouping is used, in turn, to compute summaries and graph results.
  • Cluster Summaries: Each cluster from the Kmeans results is then broken down into a cluster summary consisting of the list of terms representing that cluster, as well as the documents that are representative of that cluster.

All of this work is accomplished by simply creating a CorpusSummary object, as follows:

cs = CorpusSummary$new(cleanset, cleanset.preStem, 
                       k.clusters = 5, k.rounds = 5, sparse.maximal = 0.9, 
                       min.words.per.doc = 3)
## 2018-07-29 00:23:29 INFO::PROCESSING CORPUS WTIH DOCUMENTS: 27159
## 2018-07-29 00:23:29 INFO::...creating Matrix...
## 2018-07-29 00:24:00 INFO::......found 84,260 terms...
## 2018-07-29 00:24:00 INFO::...getting stem completions...
## 2018-07-29 00:24:42 INFO::...weighting the matrix...
## 2018-07-29 00:24:43 INFO::...removing sparse terms at maximal of 0.900000...
## 2018-07-29 00:24:45 INFO::......reduced to 109 terms...
## 2018-07-29 00:24:45 INFO::...creating euclidian distance matrix...
## 2018-07-29 00:24:48 INFO::...finding 5 kmeans clusters over 5 rounds...
## 2018-07-29 00:24:48 INFO::...composing summary...
## 2018-07-29 00:24:49 INFO::--- *** Cluster 1: 20,596 Documents, 26 Terms
## 2018-07-29 00:24:50 INFO::--- *** Cluster 2: 12,519 Documents, 29 Terms
## 2018-07-29 00:24:50 INFO::--- *** Cluster 3: 27,153 Documents, 7 Terms
## 2018-07-29 00:24:51 INFO::--- *** Cluster 4: 9,437 Documents, 31 Terms
## 2018-07-29 00:24:53 INFO::--- *** Cluster 5: 13,085 Documents, 16 Terms
## 2018-07-29 00:24:53 INFO::...DONE processing.

Some results to consider:

  • The number of terms can be quite large, sometimes making it seem as if the entire dictionary has been captured. This is not the case. Even after cleaning the data, there are many individual “terms” which are not actually dictionary words. OCR-derived text makes this much worse as the letter selection is not perfect, especially when a dictionary match was not found for the OCR engine.
  • After reducing the results to a sparse matrix, there are usually only a few dozen terms left, perhaps less than 150 at most. This is typical.
  • The number of documents in each cluster represents the number of documents from the sparse results that have at least a certain number of words in them. This means that since the Kmeans results are based on word distance (similarity), the documents in each cluster may overlap, and therefore the number of documents in all clusters may be greater than the total number of documents.

Using the CorpusSummary Class

Constructor Arguments

The constructor arguments are as follows:

  • Corpus (required) - This is any corpus really, but assumptions are made that this is a pre-processed corpus in all lower case, with no unwanted stop words, and stemmed.
  • Pre-stemmed Corpus (optional) - This is a copy of the corpus just before the stemming step so as to capture the word endings for later re-stitching.
  • k.clusters (optional, default=5) - The number of cluster groups to create. Usually, this is a small handful, even with a large corpus.
  • k.rounds (optional, default=5) - This is the number of iterations of the kmeans algorithm that will be processed. The results quickly reach a point of diminishing returns, probably in the area of 5-10 iterations.
  • method (optional, default=“euclidian”) - the distance matrix method. See “?dist” documentation for more information.
  • sparse.maximal (optional, default=.8) - Specifies how much of the corpus to include after removing “sparse” terms.

Getters

Here are the getters to retrieve the individual results from a CorpusSummary object “cs”.

getCorpus()

Get the original corpus that was processed:

c = cs$getCorpus()
getDist()

Get a distance matrix of type “dist”:

d <- cs$getDist()
getDTM()

There are two ways to get the Document Term Matrix. The default gives only the most relevant terms, or by setting the “sparse” argument to “FALSE”, get the full set of terms (minus stopwords).

dtm <- cs$getDTM() #sparse = TRUE
dtm 
## <<DocumentTermMatrix (documents: 27159, terms: 109)>>
## Non-/sparse entries: 651229/2309102
## Sparsity           : 78%
## Maximal term length: 21
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
#and now the full index, which you'll see is much larger
dtm.full <- cs$getDTM(sparse=FALSE) # get the full DocumentTermMatrix
dtm.full
## <<DocumentTermMatrix (documents: 27159, terms: 84260)>>
## Non-/sparse entries: 2717676/2285699664
## Sparsity           : 100%
## Maximal term length: 136
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
getMostFrequentterms()

Get most frequent terms from full or sparse processing:

wf = cs$getMostFrequentTerms(sparse = TRUE)
head(wf[!wf$word %in% rownames(wf),], 10) # true is default
##                word     freq
## mill          mills 338.6136
## messag      message 316.4474
## origin     original 292.0921
## secretari secretary 224.9036
## januari     january 220.0542
## septemb   september 216.5710
## juli           july 207.4680
## offic        office 190.3231
## meet        meeting 187.5624
## octob       october 186.6647

Each row name shows the stem of the word, while the actual word displayed shows a possible stem completion. This completion is not perfect, by the way, but for display purposes, it’s better than the stem, which can be terse.

getStemComplationTable()

Get the list of the most frequent stem completion for each stem:

sct <- cs$getStemCompletionTable()
diffs <- which(as.character(sct$stem) != as.character(sct$completion) & (sct$stem %in% rownames(sct) ))
head(sct[diffs,], 10)
##                  stem   completion
## depart         depart   department
## unclassifi unclassifi unclassified
## releas         releas      release
## messag         messag      message
## origin         origin     original
## secretari   secretari    secretary
## mill             mill        mills
## presid         presid    president
## govern         govern   government
## offic           offic       office

This data is saved and reused often to create more human-readable output in the case of plots, etc.

getTermsFromDoc()

Given any single document number, retrieve the terms from that document number

head(cs$getTermsFromDoc(doc.number = 1), 10)
##                         word       freq
## iran                    iran 0.14603876
## clearer              clearer 0.14237293
## trick                  trick 0.13903632
## swing                  swing 0.12670726
## card                    card 0.11368405
## stronger            stronger 0.11274416
## intervention interventionism 0.09594469
## help                    help 0.08952631
## discuss              discuss 0.08924969
## good                    good 0.08873535
getKmeansResults()

A Kmeans has been run on the corpus, data used to compose an interesting map of the corpus data in the next step.

kr <- cs$getKmeansResults()
getSummary()

The getSummary method returns a list of paired vectors, one from each Kmeans group returned that give the terms selected for that cluster, and the list of documents meeting the minimum word occurrence threshold.

s <- cs$getSummary()
head(s[[5]]$docList)
## [1] 16905  5856 16904 17820 23107 23110
head(s[[5]]$termList,10)
##                              word     freq
## messag                    message 316.4474
## origin                   original 292.0921
## hrodclintonemail hrodclintonemail 245.9795
## full                         full 232.1467
## secretari               secretary 224.9036
## millscdstategov   millscdstategov 195.8275
## meet                      meeting 187.5624
## hdrclintonemail   hdrclintonemail 161.7180
## today                       today 154.3978
## email                       email 140.8304

Difficulties for Raw Data Visualization

Ultimately, presenting data to users involves data visualization. Many data visualization techniques, as shown below, suffer from one of these issues:

  • The amount of data is too dense for the layout, making it hard to read
  • The data is too “flat”; has no dimension to help with readability
  • The data is too cluttered with noise
  • The visualization is poorly designed to deliver the intended result

The visualizations below quickly demonstrate the problems encountered with raw data visualization. In particular, the relative inability of these visualizations to deliver insight. The remainder of the article focuses largely on incrementally overcoming these deficiencies.

[R SOURCE: theme_safari() function used in visualizations]
theme_safari = function() {
    wb = element_rect(fill="white")
    return(theme_fivethirtyeight() + 
               theme( legend.background = wb,
                      legend.key = wb,
                      plot.background = wb,
                      panel.background = wb,
                      panel.grid.major = element_blank(),
                      axis.title.x = element_blank(),
                      axis.text = element_text(colour="white"),
                      axis.ticks.x = element_blank(),
                      axis.ticks.y = element_blank(),
                      plot.title = element_text(hjust=0.5)
                      ))
}

Term Frequency Bar Chart

Every visualization suite must have at least one bar graph (or pie chart, which generally should be avoided):

[R SOURCE: Term Frequency Bar Chart]
pal <- brewer.pal(6,"Dark2")[-(1)]

ggplot(head(wf, 35), aes(x = reorder(word, freq), y = freq)) + 
  geom_bar(stat = "identity", fill=pal[2], show.legend=F) +
  geom_text(aes(y=0, label=word), hjust=0, nudge_y=10, fontface="bold", color="white") +
  coord_flip() +
  ggtitle("Relative Word Frequencies for HRC Dataset") + 
  labs(y = "Ocurrences", x = "Words by Rank") +
  theme_safari() + 
  theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
        plot.title = element_text(size="15", face="bold", hjust=.5))

Cluster Map based on Term Corrrelations

Some form of cluster-map is common, and it’s tempting to want to see what they’ll produce. In this unprocessed visualization, the data and the lahout are just so much noise. Even with typical stop words removed, it’s just not revealing actionable content. Most notably, the terms are split such that Cheryl Mill’s name is not even in the same cluster as her email address, which while mathematically correct, isn’t very useful information in the context of evaluating the corpus.

[R SOURCE: Simple Kmeans / Distance Plot using fviz_cluster()]
fviz_cluster(cs$getKmeansResults(), cs$getDist(completions = TRUE), 
             main="Clustering: Single-word Co-occurrence by Document",
             repel=TRUE, labelsize=10, pointsize=2, 
             ellipse.alpha=.15, 
             xlab = FALSE, ylab = FALSE,
             ggtheme=theme_safari() 
             )

Improving Bag of Words with Normalization

Most data analysis efforts of any consequence involve normalization. There are many confounding variables that will skew a purely generic approach. In this example, the data is OCR data from email, and it comes with with several associated problems that make generic approaches applied to basic examples (e.g. novels) far less effective. Modern email is almost a language unto itself, and when terms are skewed by OCR errors, more work has to be done.

Basic Normalization Techniques

While the data was run through the out of the box cleaning, more work has to be done to further normalize the data. To start, create a matrix to hold some upcoming regular expressions with replacement values. Getting rid of everyday email clutter will help to process our meaningful data signals better with the kmeans analysis and further visualization efforts.

subs <- matrix(nrow=0, ncol=2)

Dates and Times

Email like this tends to include a lot of date/time and planning information, which has skewed our results. Unless this is key to understanding the content—it usually isn’t—this type of content should be removed, or in this case “normalized” into a meaningful token (“DAY”, or “MONTH”) In this case, language is assumed to be English, but of course, this would need to be more sophisticated as to language detection in a commercial application.

subs <- subs %>%
          rbind(c("Saturday|Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Tommorrow|Yesterday", "DAY")) %>%
          rbind(c("January|February|March|April|May|June|July|August|September|October|November|December", "MONTH")) %>%
          rbind(c("CALENDAR ENTRY:", "CALENDAR"))

Some Oddities of Email and Scanned Images

This dataset has some additional noise that is particularly bothersome in further analysis:

  • Replies sometimes come with a moniker “—– Original Message —-”.
  • Each document has a header and footer that identifies it as part of the release set. This noise is skewing our results considerably!
  • There are stray punctuation like em-dash and a bullet that aren’t handled by the APIs.

All of these are removed.

subs <- subs %>%
          rbind(c("(UNCLASSIFIED U.?S.? Department of State Case No.? F[0-9\\-]{10,12} Doc No?)|[—•]|Original Message|((Sent: )*(\\w+day), (Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?) \\d+, \\d+ \\d+:\\d+ (AM|PM))", " ")) 

Normalizing Corpus-Specific Data – Hillary Clinton

In this corpus, it would be helpful to better understand the association of key players in the text. Like email in any organization, corpus-specific abbreviations are troublesome. In particular, Hillary Clinton has many abbreviations or alternatives:

  • “HRC”
  • “Mrs. Clinton”
  • “H” and “H2”

The below expression attempts to replace all of these with her email address as a unique identifier. This will be changed again later on.

subs <- subs %>%
          rbind(c("(\\sHRC\\s)|(^(To:|From:|CC:)\\s*H2?$)|(Mrs.?\\s*Clinton)|(\\sH\\scalled)", "hrod17@clintonemail.com")) 

Removing this can be difficult to generalize. It is tricky to avoid false positives, which depending on the end process could be benign, or could lead to ugly, unintended consequences.

Error-Correcting Optical Character Recognition

The OCR process introduces errors of all kinds, some of which bias the data greatly. As an example, in this corpus, the copyright (©) symbol was detected where an at sign (@) was the correct choice; a completely understandable and easily remedied error. Another fun one is for Lona Valmoro, whose email ValomoroLJ@state.gov confuses the OCR into thinking that the letters “LJ” are a variety of combinations, including “U”, “l3”, and even “.]” due to the lack of dictionary fix-up capabilities here. Cleaning up this data is essential if Lona Valmoro represents a subject in our review of documents. There are some techniques for fixing these after the fact, the most common one being giving some wiggle room for the number of characters, and some spelling errors in important places. In the below example, there is some additional leeway given to whether “.com” might be followed by a stray character, a common occurrence.

These fix-ups are done only against some of the more important people and email addresses, though it could be generalized more broadly, at a processing cost.

subs <- subs %>%
          rbind(c("©state.gov[eil1]?",                                    "@state.gov")) %>%
          rbind(c("©clintonemail.com[eil1]?",                             "@clintonemail.com")) %>%
          rbind(c("abedin[[:alpha:]\\(]{1,3}s[tl]ate.gov",                "abedinh@state.gov")) %>%
          rbind(c("hdr22[[:alpha:]\\(]{1,3]clintonemail.gov",             "hdr22@clintonemail.com")) %>%
          rbind(c("millscd[[:alpha:]\\(]{1,3}state.gov",                  "millscd@state.gov")) %>%
          rbind(c("sulliva[3jnrial]{3}[©@p]state.gov|Jake.Sulli[vy]an",   "sullivanjj@state.gov")) %>%
          rbind(c("[1ijf]?vaolmoro[lju1i©@.\\3\\]]{1,4}state.gov[iel1]?", "valmorolj@state.gov"))

Augmenting Normalization through External Data Sources

One way to normalize is to turn the principal players as unique terms as opposed to name and email-related terms spread all over the bag of words. This will greatly consolidate the clustering job, and will begin to illuminate some information that wasn’t present before. This is accomplished through some relatively intense regular expression work. The first step is to load a directory of key players along with their name and up to two email addresses. Each is then associated with a unique single-term moniker. So for example, Hillary Clinton is now “HILLARYCLINTON” and Huma Abedin is now “HUMAABEDIN”.1

directory <- read.csv(file="directory-data.csv") 
Moniker First Last Email1
CHERYLMILLS Cheryl Mills millscd@state.gov
HILLARYCLINTON Hillary Clinton hrod17@clintonemail.com
JAKESULLIVAN Jacob Sullivan sullivanjj@state.gov
HUMAABEDIN Huma Abedin abedinh@state.gov
LAURENJILOTY Lauren Jiloty JilotyLC@state.gov
LONAVALMORO Lona Valmoro ValmoroLJ@state.gov
MONICAHANLEY Monica Hanley HanleyMR@state.gov

Using the directory, a similar regular-expression based search and replace is formed to turn all of these instances into single-term tokens. In this case, a human-readable moniker is used for clarity, but it would be possible to use a unique hash that could be used as a lookup into the directory as well.

[R SOURCE: Compose Moniker Replacement Regular Expressions]
d <- as.matrix(directory)
replace.monikers <- matrix(nrow=0, ncol=2, dimnames=list(NULL,c("pattern","replacement")))
for (r in 1:nrow(d)) {
    e2 <- d[r,"Email2"]
    if (nchar(e2)>0) {
        e2 <- paste("|", e2, sep="")
    } else {
        e2 <- ""
    }
    
    if (nchar(d[r, "First"]) > 0 && nchar(d[r, "Last"] > 0)) {
        patt <- sprintf("(%s\\s?(%s\\s?)?%s)|(((%s,\\s*%s\\s*(%s)?\\s*)?((\\<\\s*|\\[mailto:)?(%s%s)))(\\s*[\\]\\>])?)|(%s,\\s*%s(\\s?%s)?)",
            d[r, "First"], d[r, "Middle"], d[r, "Last"],
            d[r, "Last"], d[r, "First"], d[r, "Middle"], d[r, "Email1"], e2,
            d[r, "Last"], d[r, "First"], d[r, "Middle"])
        if (nchar(d[r,"Nickname"])>0) {
            patt <- paste(patt, sprintf("|(%s[\\s.]*%s)", d[r, "Nickname"], d[r, "Last"]), sep="")
        }
            
    } else {
        patt <- sprintf("(\\<\\s*|\\[mailto:)*(%s%s)(\\s*[\\]\\>])*",
                d[r, "Email1"],
                e2)    
    }
    
    #add a space on both sides of the moniker to insure it gets termed correctly 
    replace.monikers <- replace.monikers %>% rbind(c(patt, d[r,"Moniker"]))
}

Processed Document

Most of the email addresses are now gone and replaced by the monikers:

[R SOURCE: Test Moniker Replacement, Single Document]
doc1 <- corpus[[12]]$content
rm <- replace.monikers
rm[,2] <- paste("<mark>", rm[,2], "</mark>", sep="")
doc1p <- mgsub(doc1, subs[,1], subs[,2], ignore.case=T, perl=T, fixed=F) %>%
            mgsub(rm[,1], rm[,2], ignore.case=T, perl=T, fixed=F)

cat("<pre><code class='h1js'>", paste(doc1p, collapse=" "), "</code></pre>")
        . C05765938 Date: 08/31/2015                                                                                                 RELEASE IN PART B5   From:                              JAKESULLIVAN Sent:                              DAY, MONTH 6,2009 11:00 AM HILLARYCLINTON Subject:                           Re: Jake-    We could reach out in the next couple of hours or alternatively early tomorrow. The Eikenberry points sent last night, plus Holbrooke's add'n, provide a good roadmap.     Let us know how you'd like to proceed.          From: H HILLARYCLINTON To: JAKESULLIVAN i Sent: Sun Dec 06 10:33:09 2009 Subject: Re: Jake--  When can I make the Karzai call?        From: JAKESULLIVAN HILLARYCLINTON Cc: HUMAABEDIN Sent: Sun Dec 06 10:30:03 2009 Subject: Fw: Jake--  FYI          From: Holbrooke, Richard To: JAKESULLIVANi Sent: Sun Dec 06 08:06:33 2009 Subject: Jake--  H called me DAY when I was out of pocket. Available now if she still wants to talk.                Thanks, R            . C05765938 Date: 08/31/2015  

Original

Here is the raw content from the corpus:

## [1] "      UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05765938 Date: 08/31/2015                                                                                                 RELEASE IN PART B5   From:                              Sullivan, Jacob J <SullivanJJ@state.gov> Sent:                              Sunday, December 6,2009 11:00 AM To:                                H Subject:                           Re: Jake-    We could reach out in the next couple of hours or alternatively early tomorrow. The Eikenberry points sent last night, plus Holbrooke's add'n, provide a good roadmap.     Let us know how you'd like to proceed.        Original Message From: H <HDR22@clintonemail.com> To: Sullivan, Jacob i Sent: Sun Dec 06 10:33:09 2009 Subject: Re: Jake--  When can I make the Karzai call?      Original Message From: Sullivan, Jacob J <Sullivanii@state.gov> To: H Cc: Abedin, Huma <AbedinH@state.gov> Sent: Sun Dec 06 10:30:03 2009 Subject: Fw: Jake--  FYI        Original Message From: Holbrooke, Richard To: Sullivan, Jacobi Sent: Sun Dec 06 08:06:33 2009 Subject: Jake--  H called me yesterday when I was out of pocket. Available now if she still wants to talk.                Thanks, R           UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05765938 Date: 08/31/2015 \f"

Reprocess the Corpus With Enhanced Normalization

The previously-prepared regular expressions are now part of the cleaning regimen, from which more interesting visualizations can be produced. Before that, the CorpusSummary is re-computed, but after applying the substitutions previously discussed.

[R SOURCE: Data Cleaning Regimen with Additional Normalization]
cleanset <- corpus 
cleanset <- tm_map(cleanset, content_transformer(textclean::mgsub), pattern=subs[,1], replacement=subs[,2], ignore.case=T, perl=T, fixed=F)

cleanset <- tm_map(cleanset, stripWhitespace) 
cleanset <- tm_map(cleanset, content_transformer(textclean::mgsub),
                   pattern=replace.monikers[,1], replacement=replace.monikers[,2],
                   ignore.case=T, perl=T, fixed=F, leadspace=TRUE, trailspace=TRUE)

cleanset <- tm_map(cleanset, content_transformer(tolower)) 
cleanset <- tm_map(cleanset, removeNumbers, ucp = FALSE)  
cleanset <- tm_map(cleanset, removeWords, words=english.stopwords.large)
cleanset <- tm_map(cleanset, removePunctuation)

cleanset.preStem <- cleanset
cleanset <- tm_map(cleanset.preStem, stemDocument, language="en") 
cs2 <- CorpusSummary$new(cleanset, cleanset.preStem, 
                       k.clusters = 4, k.rounds = 5,
                       sparse.maximal = 0.92, 
                       min.words.per.doc = 3)
## 2018-07-29 00:36:10 INFO::PROCESSING CORPUS WTIH DOCUMENTS: 27159
## 2018-07-29 00:36:10 INFO::...creating Matrix...
## 2018-07-29 00:36:55 INFO::......found 81,575 terms...
## 2018-07-29 00:36:55 INFO::...getting stem completions...
## 2018-07-29 00:37:54 INFO::...weighting the matrix...
## 2018-07-29 00:37:54 INFO::...removing sparse terms at maximal of 0.920000...
## 2018-07-29 00:37:58 INFO::......reduced to 127 terms...
## 2018-07-29 00:37:58 INFO::...creating euclidian distance matrix...
## 2018-07-29 00:38:04 INFO::...finding 4 kmeans clusters over 5 rounds...
## 2018-07-29 00:38:04 INFO::...composing summary...
## 2018-07-29 00:38:05 INFO::--- *** Cluster 1: 1,603 Documents, 8 Terms
## 2018-07-29 00:38:06 INFO::--- *** Cluster 2: 11,767 Documents, 22 Terms
## 2018-07-29 00:38:06 INFO::--- *** Cluster 3: 26,635 Documents, 47 Terms
## 2018-07-29 00:38:08 INFO::--- *** Cluster 4: 15,204 Documents, 50 Terms
## 2018-07-29 00:38:08 INFO::...DONE processing.

Enhanced Basic Visualizations

Viewing the Tagged Content

Now that key people are tagged, tagged content can now be extracted directly from the summary statistics:

[R SOURCE: Relative Frequency of Identified Monikers]
wf <- cs2$getMostFrequentTerms(sparse = FALSE) 
mlist <- 
  data.frame( moniker=directory$Moniker,
              displayName=paste(directory$First, directory$Last, sep=" "),
              freq=(wf[stemDocument(tolower(directory$Moniker), language="en"),"freq"]),
              stem=tolower(directory$Moniker),
              row.names="stem",
              stringsAsFactors = F
)

ggplot(mlist, aes(x = reorder(displayName, freq), y = freq)) + 
  geom_bar(fill=pal[2], stat = "identity", show.legend=F) +
  geom_text(aes(label=displayName, y=0), hjust=0, nudge_y=10, 
            fontface="bold", color="white") +
  coord_flip() +
  ggtitle("Key Persons Relative Occurrences as Terms") + 
  theme_safari() + 
  theme(axis.text.y = element_blank(), axis.text.x = element_blank(),
        plot.title = element_text(size="15", face="bold", hjust=.5))

Tuning Stop Words to the Corpus

In the next set of illustrations, it is helpful to understand that the most frequent terms are not the most interesting always. This is the thought behind TFIDF (Term Frequency-Inverse Document Frequency). A great many words that surface in visualizations are of the frequent-but-uninteresting type. In this corpus, at lesat one of the terms below appears an average of 14 times per document. Keeping them in for statistical reasons, but removing them from visualizations, may help to boost signal with a minimum of effort.

[R SOURCE: Enhanced Stop Word List]
email.form.terms <- c("subject", "cc", "to", "bcc", "sent", "mon", "tue", "wed", "thu", "fri", "sat", "sun", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "oct", "nov", "dec")  

planning.words = c("today", "tomorrow", "anytime", "morning", "noon", "afternoon", "availability", "schedule", "time", "week", "day", "month", "year", "meetings", "meeting", "meet", "release", "pls", "good", "work", "will", "well", "going", "told", "email", "full", "fw", "fyi", "talk", "am", "pm", "shuttle", "udpate", "btw", "calls", "print", "fax", "list", "minutes", "asap", "drive", "plane", "flight", "connect", "todayday", "wrote", "residence", "room", "thx", "forward", "set", "finish", "checking", "depart", "departs", "departing", "arrive", "check", "route", "copy", "cscc", "copying", "messages", "attached", "mailto", "emailed", "confirmed", "mtg", "tonight", "tonite", "talked", "scheduled", "copies", "trip", "talk", "talking", "offered", "sending", "statement", "sbu")

corpus.stopwords = sort(unique(c(email.form.terms, planning.words)))

print(corpus.stopwords, quote=FALSE)
##   [1] afternoon    am           anytime      apr          arrive      
##   [6] asap         attached     aug          availability bcc         
##  [11] btw          calls        cc           check        checking    
##  [16] confirmed    connect      copies       copy         copying     
##  [21] cscc         day          dec          depart       departing   
##  [26] departs      drive        email        emailed      fax         
##  [31] feb          finish       flight       forward      fri         
##  [36] full         fw           fyi          going        good        
##  [41] jan          jul          jun          list         mailto      
##  [46] mar          meet         meeting      meetings     messages    
##  [51] minutes      mon          month        morning      mtg         
##  [56] noon         nov          oct          offered      plane       
##  [61] pls          pm           print        release      residence   
##  [66] room         route        sat          sbu          schedule    
##  [71] scheduled    sending      sent         sep          set         
##  [76] shuttle      statement    subject      sun          talk        
##  [81] talked       talking      thu          thx          time        
##  [86] to           today        todayday     told         tomorrow    
##  [91] tonight      tonite       trip         tue          udpate      
##  [96] wed          week         well         will         work        
## [101] wrote        year

By removing additional noisy stop words as the below comparison word clouds demonstrate. Noise in the dataset is particularly amplified in more complex analytics and visualizations. By combining the normalization, and the additional focus, many visualizations are more engaging. The word cloud on the right has additional important terms that were crowded out by the mundane terms on the left (in gray).

[R SOURCE: Before and After Word Clouds]
getWordCloudData = function(wf.table, nterms=50, wf.replace = NULL, stopwords=NULL, markwords=NULL) {
    w <- wf.table
    if (!is.null(stopwords)) {
        w <- head(w[-which(w$word %in% stopwords),], nterms)
    } else {
        w <- head(w, nterms)
    }

    w <- data.frame(
          word       = as.character(w$word), 
          freq       = w$freq, 
          isMarked   = ifelse(!is.null(markwords) & (as.character(w$word) %in% markwords), TRUE, FALSE),
          isReplaced = FALSE,
          row.names  = as.character(w$word), 
          stringsAsFactors = FALSE
    )
    
    if (!is.null(wf.replace)) {
        mx <- which(w$word %in% rownames(wf.replace))
        w[mx, "word"] <- wf.replace[w[mx,"word"], "displayName"]
        w[mx, "isReplaced"] <- TRUE
    }
    
    return(w)
}

pal <- brewer.pal(6,"Dark2")
pal <- pal[-(1)]

wfc <- getWordCloudData(wf, nterms=100, markwords = corpus.stopwords)

wfc2 <- getWordCloudData(wf, wf.replace=mlist, nterms=100, stopwords=corpus.stopwords)
wfc2$isMarked <- ifelse(wfc2$isReplaced | (wfc2$word %in% wfc$word), FALSE, TRUE)

par(mfrow=c(1,2))

wordcloud(wfc$word, wfc$freq, scale=c(3.5,1), 
          random.order=FALSE, random.color=FALSE, rot.per = .25, ordered.colors=TRUE,
          colors=c("DarkGray", pal)[as.integer(!wfc$isMarked) + 1 + wfc$isReplaced])

wordcloud(wfc2$word, wfc2$freq, scale=c(3.5,1), ordered.colors=TRUE,
          random.order=FALSE, random.color=FALSE, rot.per = .25, 
          colors=c("Black", pal)[as.integer(!wfc2$isMarked) + 1 + wfc2$isReplaced])

Enhanced Cluster Diagram

The cluster diagram also begins to show real promise. Instead of learning the association of “Cheryl” to “Mills”, now the association is between those principal players in a social network. Visualizations can now safely highlight key players, or otherwise reorganize visualizations in other ways to collate the data effectively. Note that with this normalization and level of cleaning, 5 groupings may be more than is necessary, but again, this is subjective; what to pay attention to is how the groupings are tending to show incrementally more signal as the data is further refined. This is just an incremental step, one of many, that start to improve the overall usefulness of the analytics, and the related visualizations.

[R SOURCE: Enhanced Kmeans Clustering w/ High Relevancy Terms]
getPlottableKmeansData = function(kmeans, dist) { 
    #code is adapted and customized from factoextra package for better control
    p.ind <-  scale(dist) %>%
              stats::prcomp(scale = FALSE, center = FALSE) %>%
              facto_summarize(element = "ind", result = "coord")
    colnames(p.ind)[2:3] <- c("x", "y")
    p.data <- cbind.data.frame(p.ind, 
                               cluster = as.factor(kmeans$cluster[as.character(p.ind$name)]),
                               stringsAsFactors=FALSE)

    return(p.data)
}

p.data <- getPlottableKmeansData(cs2$getKmeansResults(), cs2$getDist()) %>%
            filter(!as.character(name) %in% corpus.stopwords)
p.data$name <- cs2$getStemCompletionTable()[as.character(p.data$name),"completion"]
p.repl <- which(p.data$name %in% rownames(mlist))

p.clgroups <- by(p.data[-p.repl,], p.data[-p.repl,"cluster"], FUN = (
    function(c) {
        labels <- str_wrap(paste(head(c[order(-c$coord), "name"], 20), collapse=", "), width=1)

        cg <- data.frame(
            x = mean(c$x), y = mean(c$y),
            cluster = as.factor(c$cluster[1]),
            isMoniker = FALSE,
            label = labels,
            labelSize = 5,
            force=25,
            fontface="plain",
            stringsAsFactors = FALSE
        )
        
        return(cg)
    }
  )) %>%
  rbind.fill() %>%
  rbind(data.frame(
      x = p.data$x[p.repl], 
      y = p.data$y[p.repl],
      cluster = p.data$cluster[p.repl],
      isMoniker = TRUE,
      labelSize = 6,
      fontface="bold",
      force=200,
      label = mlist[as.character(p.data$name[p.repl]), "displayName"],
      stringsAsFactors = FALSE
  ))

ggplot(p.data, aes(x=x, y=y)) +
        geom_encircle(data=p.data, show.legend=FALSE, inherit.aes = T, aes(fill=cluster), alpha=.25, expand=.05) +
        geom_text_repel(data=p.clgroups[p.clgroups$isMoniker,], 
                        aes(x=x, y=y, 
                            label=str_wrap(label, width=30), 
                            color=cluster, size=labelSize, fontface=fontface),
                            direction="y", force=10, show.legend = FALSE) +
        geom_point(data=p.data[p.repl,], 
                   aes(x=x, y=y, color=cluster), size=4, alpha=.5) +
        geom_text_repel(data=p.clgroups[!p.clgroups$isMoniker,], 
                        aes(x=x, y=y, 
                            label=str_wrap(label, width=30), 
                            color=cluster, size=labelSize, fontface=fontface),
                            # ylim=c(mean(p.clgroups[!p.clgroups$isMoniker,"y"]), max(p.clgroups$y)),
                            direction="both", force=55, show.legend = FALSE,
                            lineheight=.8, vjust=.2, segment.alpha = .5) +
        geom_point(data=p.data[-p.repl,], 
                   aes(x=x, y=y, color=cluster), size=1) +
        geom_point(data=p.clgroups[!p.clgroups$isMoniker,], 
                   aes(x=x, y=y, color=cluster), size=4) +
        scale_x_continuous(expand = c(.2, .2)) +
        scale_y_continuous(expand = c(.2, .2)) +
        scale_size_continuous(range = c(5,8)) +
        scale_color_discrete(name="Cluster") +
        ggtitle("Clustered Terms by Association") + 
        theme_safari()

Cluster Summaries

The cluster summary data from this round can be used in a variety of ways, including producing a navigable UI. Enhanced stop words remain for this report.

[R SOURCE: Composing a Summary Report from Cluster Output]
s <- cs2$getSummary()
s.terms <- lapply(1:length(s), FUN= (function(i) {
      mon <- tolower(mlist$moniker)
      m <- which(s[[i]]$termList$word %in% mon)
      t <- which(!s[[i]]$termList$word %in% mon) 
      p <- paste(gsub("\\s", "&nbsp;", mlist[as.character(s[[i]]$termList$word[m]), "displayName"]), collapse=", ")
      r <- data.frame(
              Cluster=i, 
              Matching=length(s[[i]]$docList), 
              People=p,
              Terms=paste(s[[i]]$termList$word[t], collapse=", "), 
              stringsAsFactors = F)
      return(r)
  })) %>%
  rbind.fill()
Cluster Matching People Terms
1 1,603 Cheryl Mills, Jacob Sullivan, Huma Abedin talk, tomorrow, schedule, calls, statement
2 11,767 Hillary Clinton full, secretary, department, meeting, office, today, email, fyi, report, security, house, morning, cheryl, clinton, david, attachments, point, message, speak, question, thought
3 26,635 government, well, people, support, national, country, public, washington, source, united, senior, american, policy, group, personal, international, including, subject, move, political, leaders, forces, position, change, military, close, breaking, appears, direct, power, interest, general, members, response, original, efforts, continue, clear, service, open, concerned, provide, long, remain, day, month, release
4 15,204 will, state, work, time, president, discuss, news, good, best, week, minister, send, officials, issues, hope, sure, foreign, confidential, year, help, reason, great, plan, assistance, asked, note, staff, press, conference, told, forward, ambassador, things, hillary, reach, case, wanted, set, visit, development, agreed, election, private, start, affairs, received, head, find, post, number

There’s a vague sense of some topics. In the next section, N-grams are used to find more interesting terms.

Summary of Extended Bag of Words Processing

There are two important takeaways here:

  • Signal strength is improved by by processing the data smartly based on context. Specifically, removing the noise of email (days, weeks, calendars, planning), and fixing some pervasive OCR errors that are specific to scanned email (such as the confusion between at sign ‘@’ and copyright ‘©’).
  • Analytics and visualizations are improved by normalizing character sequences that identify persons from an external directory, which noticeably improves the cluster relevance and the resulting visualization.

This level of processing of data, while requiring significant up front work, can be productized, and improve value of existing analytics techniques and visualizations.

Identifying More Interesting Associations with N-Grams

Heading deeper into the data, the “tidytext” package is deployed to extract n-grams: multi-word sequences that occur frequently in a text, and as such, more likely represent meaningful relationships. The following visualization is adapted from the online site for the (quite excellent) book “Text Mining with R” and shows how to use TidyText to build a fairly robust visualization of n-grams.

This basic visualization simply shows the more frequent N-grams connected in a graph to show clusters of N-gram by individual terms. In this case, only bigrams, but longer sequences could also be identified, with diminishing returns above N=3.

[R SOURCE: Identifying Bigrams from Raw Text]
hrcemails <- tidy(cleanset.preStem) 

count_bigrams <- function(dataset) {
  dataset %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    dplyr::count(word1, word2, sort = TRUE)
}

hrcemails.bigrams <- hrcemails %>%
  count_bigrams() %>%
  filter(!(word1 %in% corpus.stopwords), !(word2 %in% corpus.stopwords)) 

visualize_bigrams <- function(bigrams) {
  set.seed(824)
  a <- grid::arrow(type = "closed", length = unit(.08, "inches"))
  
  bigrams %>%
    graph_from_data_frame() %>%
    ggraph(layout = "fr") +
    geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
    geom_node_point(color = "Gray", size = 3) +
    geom_node_text(aes(label = name), vjust = 1, hjust = 1, size=4) +
    theme_void()
}

visualize_bigrams(head(hrcemails.bigrams, 100))

While this has some promising data, it’s difficult to look at and follow because it has no depth or differentiation. Another problem is that names are still tending to clutter here. The graph can be augmented by adding dimension from outside sources to give each N-gram a context. This is not an exact science, but by prioritizing the identification of context in each N-gram2, it delivers sufficient context to make following the graph more reasonable. By using dictionary-matching, a type is appended to each N-gram as a form of entity extraction. This dictionary matching creates 5 different types:

  • Organizations (“Org”)
  • Titles (“Title”)
  • Names (“Name”)
  • Countries (“Country”)
  • Common 2-word Bigrams3 (“Phrase”)

A new column called “type” acts as a cue to enhance the visualization. The code below isn’t meant to be robust, but rather to demonstrate how entity extraction can be used in practice.4

[R SOURCE: Augmenting Bigrams with Entity Type Tags]
hrcNgrams <- filter(hrcemails.bigrams,
                            !str_detect(hrcemails.bigrams$word1, "^\\w$"), 
                            !str_detect(hrcemails.bigrams$word2, "^\\w$"))

hrcNgrams$type <- NA

getBigramRowMatchesFromFile = function(x, file) {
    e <- read.csv(file=file, header=T, sep=" ")
    return(which(is.na(x$type) & x$word1 %in% e$word1 & x$word2 %in% e$word2))
}
hrcNgrams[getBigramRowMatchesFromFile(hrcNgrams, "titles.txt"), "type"] <- "Title"
hrcNgrams[getBigramRowMatchesFromFile(hrcNgrams, "orgs.txt"), "type"] <- "Org"

first.names <- 
    tolower(scan(file="First_Names.csv", what="character")) %>%
    unique()
last.names <-
    tolower(scan(file="Last_Names.csv", what="character")) %>%
    unique()
hrcNgrams.names <-
      which(is.na(hrcNgrams$type) &
          ((hrcNgrams$word1 %in% first.names & hrcNgrams$word2 %in% last.names) | 
              (hrcNgrams$word2 %in% first.names & hrcNgrams$word1 %in% last.names)))

hrcNgrams[hrcNgrams.names,"type"] <- "Name"

countries <- read.csv(file="countries_split.csv")
hrcNgrams.countries <- 
        which(is.na(hrcNgrams$type) & 
                (hrcNgrams$word1 %in% countries$first | 
                (hrcNgrams$word2 %in% countries$first &
                hrcNgrams$word2 %in% countries$second)))
hrcNgrams[hrcNgrams.countries,]$type <- "Country"

en2grams <- read.csv(file="en2grams.csv")
hrcNgrams.twograms <- 
        which(is.na(hrcNgrams$type) & 
                (hrcNgrams$word1 %in% en2grams$word1 & 
                hrcNgrams$word2 %in% en2grams$word2))
hrcNgrams[hrcNgrams.twograms,]$type <- "Phrase"

hrcNgrams[is.na(hrcNgrams$type),]$type <- "Other"
word1 word2 n type
united states 7,151 Country
state department 6,233 Org
white house 4,699 Org
department state 3,812 Org
prime minister 3,392 Title
secretary state 3,187 Title
secretary office 3,014 Org
human rights 2,544 Phrase
secretary clinton 2,448 Phrase
foreign policy 2,029 Phrase
middle east 1,988 Phrase

With this categorized data, the visualization has depth. The graph below demonstrates the value of the type in a low-dimensional graph, but combined with other variables, as will be seen later, the effect on overall understanding of the data is remarkable.

[R SOURCE: Bigram Graph with Typed Categories]
visualize_bigrams_2 <- function(bigrams, verts) { 
  set.seed(1587)
  a <- grid::arrow(type = "closed", length = unit(.08, "inches"))
  
  bigrams %>%
    graph_from_data_frame(vertices=verts) %>%
    ggraph(layout = "nicely") +
    geom_edge_link(aes(color = type), show.legend=FALSE, arrow = a) +
    geom_node_point(aes(color = type), size=3, show.legend=TRUE) +
    geom_node_text(aes(label = name, color=type), 
                   vjust = 0, hjust = 1, size=4, nudge_x = -.25, 
                   show.legend = FALSE) +
    scale_color_brewer(palette="Dark2", type="qual", guide="legend",
                       direction=-1) +
    scale_edge_color_brewer(palette="Dark2", type="qual", guide="none",
                       direction=-1) +
    scale_x_continuous(expand = expand_scale(c(0.175, 0.0))) +
    labs(color="Type") +
    theme_safari()
}

# This bit of nonsense is necessary to make the graph vertices work; why it
# can't infer the vertices from the edge as an option is a question for a
# different forum.  
getVertsFromBigrams = function(bigrams) {
    verts <- 
        data.frame(name=c(bigrams$word1, bigrams$word2),
                   type=c(bigrams$type, bigrams$type)) %>%
        filter(!duplicated(name))
    return(verts)
}

hn <- head(hrcNgrams, 125) %>%
        filter(!word1 %in% tolower(directory$Moniker), !word2 %in% tolower(directory$Moniker))
verts <- getVertsFromBigrams(hn)

visualize_bigrams_2(hn, verts)

Term Clustering by Person

The previous was just correlation of words. It’s interesting as an enhanced word cloud, but doesn’t go beyond a general understanding of topics discussed in the entire corpus. By cross-referencing these n-grams against external contextual clues, it’s possible to glimpse person-to-topic correlations in a reasonably abstract way.

[R SOURCE: Compute Raw Word Correlations Across Corpus]
mwc.stopwords <- c(corpus.stopwords, first.names[-which(first.names %in% last.names)])

getWordsFromText = function(texts, stopwords, ids=NULL) {
  t <- texts
  if (!is.null(ids)) t <- t %>% filter(id %in% ids)

  words <- t %>%
      select(id, text) %>%
      unnest_tokens(word, text) %>%
      filter(nchar(word) > 2, !(word %in% stopwords))
  
  return(words)
}

getWordCorrelations = function(w, limit = 100, positive.only = TRUE) {
    corr <- w %>%
      group_by(word) %>%
      filter(n() > limit) %>%
      pairwise_cor(word, id, sort=TRUE)
    
    if (positive.only)
      corr <- corr %>%
        filter(correlation > 0.01)
  
    return(corr)
}

mwc.words <- getWordsFromText(hrcemails, mwc.stopwords)
# Necessary because the correlation should be against stems
mwc.unique.words <- unique(mwc.words$word)
mwc.stem.lookup <- data.frame(word=mwc.unique.words, 
                              stem=stemDocument(mwc.unique.words, language="en"), 
                              row.names = mwc.unique.words)

mwc.wft = cs2$getMostFrequentTerms(sparse = FALSE)
mwc.words$word <- mwc.stem.lookup[mwc.words$word,"stem"]

mwc.corr <- getWordCorrelations(mwc.words) 
mwc.corr$freq1 <- mwc.wft[as.character(mwc.corr$item1), "freq"]
mwc.corr$freq2 <- mwc.wft[as.character(mwc.corr$item2), "freq"]
mwc.corr$item1 <- mwc.wft[as.character(mwc.corr$item1), "word"]
mwc.corr$item2 <- mwc.wft[as.character(mwc.corr$item2), "word"]

In the following graph, terms that are highly correlated with a person found in the previously-mentioned directory are plotted. Edges indicate the degree of correlation, which is purely optional, but does convey some additional understanding.

[R SOURCE: Graph Word Correlation with Key Players]
graph.term.by.person = function(corr.words, moniker.list, g.layout="fr") {
    all.terms <- c(as.character(corr.words$item1), as.character(corr.words$item2)) %>% unique()
    persons = all.terms[all.terms %in% tolower(as.character(moniker.list$moniker))]
    persons2 = all.terms[all.terms %in% last.names & !all.terms %in% persons]
    terms <- all.terms[!all.terms %in% persons & !all.terms %in% persons2]
    
    gtbp.pal = brewer.pal(8, "Dark2")
    gtbp.person.text = gtbp.pal[2]
    gtbp.person.node = gtbp.pal[2]
    gtbp.person2.text = gtbp.pal[3]
    gtbp.person2.node = gtbp.pal[3]
    gtbp.term.text = "grey40"
    gtbp.term.node = gtbp.pal[1]
    
    verts <- rbind(
        data.frame(name=persons, type="KeyPerson", face="bold", size=5,
                   displayName=moniker.list[persons,"displayName"],
                   nodeColor=gtbp.person.node, labelColor=gtbp.person.text),
        data.frame(name=persons2, type="Person", face="bold", size=4,
                   displayName=persons2,
                   nodeColor=gtbp.person2.node, labelColor=gtbp.person2.text),
        data.frame(name=terms, type="Term", face="bold", size=4, displayName=terms,
                   nodeColor=gtbp.term.node, labelColor=gtbp.term.text)
    )
    
    set.seed(1234)
    g = graph_from_data_frame(corr.words, vertices = verts) %>%
            ggraph(layout = g.layout) +
            geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "DarkGray", show.legend=F) +
            geom_node_point(aes(color=nodeColor), size = 4, alpha=.8) +
            geom_node_text(aes(label = displayName, fontface=face, color=labelColor, size=type), 
                           repel = TRUE, point.padding = unit(0.3, "lines"), 
                           show.legend=TRUE, check_overlap=TRUE) +
            scale_edge_width(range=c(1,3), guide="none") +
            scale_edge_alpha(guide="none", range=c(.25, .65)) +
            scale_size_manual(guide="none", values=c(6,3,3)) +
            scale_color_identity() + 
            labs(color="Type") +
            theme_safari()
          
    return(g)
}

mwc <- mwc.corr %>%
  filter(correlation > 0.20, item1 %in% tolower(directory$Moniker), !item2 %in% tolower(directory$Moniker))  %>%
  group_by(item1) %>%
  top_n(18, correlation) %>%
  ungroup() 

graph.term.by.person(mwc, mlist, g.layout="fr")

With just a little more data in the graph, including more edges to represent high-correlation connections and 2nd-degree correlations (or people-N-gram connections), some pattners begin to emerge. This really shows the power of visualization: without any modern AI–only using count-based methodologies–a pattern becomes evident. In particular, in this graph, higher-ranking officials (Clinton, Mills, Sullivan) have tighter clusters of N-grams, whereas the staff members tend to have hub-and-spoke patterns with weaker average correlations. Also, the terms for Mills and Sullivan are more saturated with names than the others in the graph, which probably relates to the high degree of headlines and press articles with her name attached; whereas others are communicating with colleagues or setting meetings. The actionable implications could be an interesting area of study, especially if a repeatable scoring mechanism could be achieved.

[R SOURCE: Plot Bigram Correlation with Key Players]
# Add additional edges to the graph
mwc2 <- mwc.corr %>% 
  filter(correlation > 0.6, item1 %in% mwc$item2, item2 %in% mwc$item2, !item1 %in% tolower(directory$Moniker), !item2 %in% tolower(directory$Moniker)) %>%
  group_by(item1) %>%
  top_n(3, correlation) %>%
  ungroup() 

# Add one additional, 2nd-degree term to the graph
mwc2.terms <- unique(c(as.character(mwc2$item1), as.character(mwc2$item2)))
mwc3 <- mwc.corr %>%
  filter(item1 %in% mwc2.terms, !item2 %in% mwc2.terms, !item1 %in% mwc$item1, !item2 %in% mwc$item1, nchar(as.character(item1))>3, nchar(as.character(item2))>3) %>%
  group_by(item1) %>%
  top_n(3, correlation) %>%
  ungroup()

graph.term.by.person(rbind(mwc, mwc2, mwc3), mlist, g.layout = "fr")

The data can also be displayed as a simple report, as seen below:

[R SOURCE: Correlated Bigrams by Key Person Report]
mwc.rpt.data <- by(mwc, mwc$item1, FUN = ( 
        function(m) {
            pt <- as.character(m$item2)
            t <- mwc2[seq(1, nrow(mwc2),2),1:2] %>%
                  filter(item1 %in% pt | item2 %in% pt) 

            return(
                data.frame(
                    Person = gsub(pattern=" ", replacement="&nbsp;",  mlist[as.character(m$item1[1]),"displayName"]),
                    Terms = paste(paste(t$item1, t$item2, sep="-"), collapse=", ")
            ))
        }
    )) %>%
    rbind.fill() %>%
    filter(nchar(as.character(Terms))>0)
Bigrams Correlated to Key Players
Person Terms
Cheryl Mills kotecki-mccary, shamim-kazemi, shamim-bastien, bastien-kazemi, mccary-pierre, kara-mcdonald, mccary-kali, kotecki-pierre, kotecki-kali, shamim-kotecki, toiv-nora, bastien-kotecki, kali-kujawinski, louis-mccary, louis-kotecki, kujawinski-kotecki, kenneth-merten, mcdonald-kujawinski, mcdonald-kotecki, lindwall-kujawinski, kenneth-lindwall, kenneth-kujawinski, adams-mcdonald, adams-bastien
Hillary Clinton rand-relieved, rand-sen, benghazi-attack, sen-relieved, rand-attack, terrorist-attack, hear-rand, tragedy-rand, tragedy-attack, misleading-sen, testimony-hear, tragedy-relieved, answer-hear, johnson-rand, happened-hear, happened-rand, dead-attack, answer-happened, dead-rand, johnson-sen, clinton-happened, clinton-rand, difference-happened, emotional-sen, difference-misleading
Huma Abedin scanlon-turkeyarmenia, turkeyarmenia-davutoglu, davutoglu-scanlon
Jacob Sullivan alon-sachar, prem-kumar, sachar-neaipa, alon-neaipa, schlicher-ronald, rubinstein-neaipa, sachar-rubinstein, rubinstein-alon, mara-rubinstein, hale-neaipa, hale-alon, sutphin-sachar, wailes-neaipa, prem-shapiro, mara-alon, shapiro-kumar, jacob-wailes, prem-neaipa, kumar-neaipa, daniel-neaipa, feltman-wailes, dennis-shapiro, jacob-rubinstein
Monica Hanley coleman-claire, pouch-pdb

Event Association and Visualization

Email can assist in telling a story. Treated as individual emails, there may be much interesting data, but aggregating a sufficient volume of email as a continuous document can often be revealing. In the next set of visualizations, events can be correlated the against key events, giving more meaningful results. In the following series of visualizations, events are used as filters to determine event-specific relationships betweeen terms. Combined with the term-types, this produces some powerful, multi-dimensional visualizations.

Extracting Metadata

Date Sent

Excluding any OCR errors, the sent date can be inferred with a regular expression search of the corpus, and used as auxliary metatata.

[R SOURCE: Extract Email Sent Time From Original Corpus]
getFirstSentTime = function(content) { 
    sent = content[grep(content, pattern="Sent:.*2\\d{3}\\s\\d{1,2}:\\d{2}\\s(AM|PM)")][1]
    if (is.na(sent)) return(NA)
    
    dtstr = gsub(sent, pattern="Sent:\\s*", replacement="", perl=T, fixed=F)
    dt = as.numeric(strptime(dtstr, "%A, %B %e, %Y %I:%M %p", tz="EST"))
    return(dt)
}

metadata <- data.frame("id"=character(), "sent.time"=numeric())
for (i in 1:length(corpus)) {
    metadata <- rbind(metadata, 
                        data.frame("id"=corpus[[i]]$meta$id, "sent.time" = getFirstSentTime(corpus[[i]]$content)))
}

cat(sprintf("Missing Time Data = %.0f%%", (length(which(is.na(metadata$sent.time)))*100)/nrow(metadata)))
## Missing Time Data = 8%

A list of events, culled from Secretary Clinton’s time in office can be used to correlate activity and even boost relevance signals relative to each event. This list is adapted from here.

keyEvents <- read.csv("ClintonEvents.csv", header = T, row.names=NULL) 
keyEvents$Date <- as.Date(keyEvents$Date, "%Y-%m-%d", origin="1970-01-01", tz="EST")
keyEvents$WindowStart <- keyEvents$Date - 7
keyEvents$WindowEnd <- keyEvents$Date + 23

Topics Following Events

The following graph gives an overview of the selection of data that will be analyzed in later graphs. As a timeline, it shows the volume of email by month, and also, the volume and range of data that will be evaluated for each event. The question will be: Can isolating this data reveal unique signals within the text for that particular event? An arbitrary window of 7 days before, and 23 days after each key event (30 day window) was selected for analysis on the theory that there is planning leading up to the event, and follow-up and review from analysis of press articles after. This is necessarily a use-case-specific assumption.

First, the event number is stamped on every document with an extracted sent date, and the number of documents for that event is counted.

[R SOURCE: Get Data by Event Window (-7/+23 days)]
#This code works as long none of the events is < 30 days apart.   
md <- metadata %>% filter(!is.na(sent.time)) 
md <- data.frame(sent.time=anydate(md$sent.time), id=md$id)
md$event <- sapply(md$sent.time, (
                        function(x) {
                            y = which(keyEvents$Date >= (anydate(x) - 7) &
                                    keyEvents$Date <= anydate(x) + 23)
                            return(y[1])
                        }
                    ))

md.eventData <- md[!is.na(md$event),]
keyEvents$nDocs <- table(md.eventData$event)

From this metadata, a timeline of email volume with overlaid events is plotted. Overlaid is a an indicator of a sharp drop in available data following the attack on the US Mission in Benghazi, Libya. This is most likely due to higher than normal redaction or classification of the content during that time period.

[R SOURCE: Timeline of Content Volume, Including Key Events]
md.freq <- as.data.frame(table(anydate(md$sent.time), dnn=c("sent.date"))) %>%
    group_by(month=floor_date(anydate(sent.date), "1 month")) 
md.freq <- aggregate(md.freq$Freq, by=list(month=md.freq$month), FUN=sum) 
colnames(md.freq) <- c("month", "freq")
 
benghazi <- md.freq[md.freq$month >= anydate("2012-09-01") & md.freq$month <= anydate("2013-01-01"),]

ggplot(md.freq, aes(x=month, y=freq)) +
        geom_line(stat = "identity", color="DarkBlue", alpha=.50) + 
        geom_area(stat = "identity", fill="DarkBlue", alpha = 0.10) +
        geom_rect(aes(xmin=WindowStart, xmax=WindowEnd, ymin=0, ymax=nDocs),
                  data=keyEvents, inherit.aes = FALSE,
                  fill="DarkGreen", color="DarkGreen", alpha=.25) +
        geom_text(data=keyEvents, mapping=aes(x=Date, y=nDocs, 
                                  label=str_wrap(EventName, 12)),
                  size=3.5, vjust=0, hjust="center", lineheight=.8, nudge_y = 5, nudge_x = 7,
                  color="DarkGreen",  fontface="bold") +
        scale_x_date(breaks = keyEvents$Date,
                     labels = date_format('%b %d, %Y')) +
  # Draw a circle around data related to Benghazi Attack
        geom_encircle(data=benghazi, show.legend=FALSE, 
                      fill="Red", alpha=.20, expand=0) +
        ylab("Email Count") + 
        scale_y_continuous(limits=c(0, max(md.freq$freq))) +
        labs(title = "Volume of Email by Month and Key Foreign Policy Events") +
        theme_fivethirtyeight() + 
        theme(plot.title = element_text(hjust=.5), 
              axis.text.x = element_text(hjust=1, angle=45, face="bold", size=12, color="DarkGreen"),
              panel.background = element_rect(fill = "white"),
              plot.background = element_rect(fill = "white")
              )

One thing the graph illustrates is that this event-wise analysis represents a fraction of the overall volume of metadata. The event-related timeframes include only 28% of the overall 1416 days in the dataset, and the selected content represents 28% of the overall data (“law of averages”).

Selecting Relevant Topic Data for Each Event

For each event, what are the top words being discussed? First, a list of all corresponding words by event and the the combined relevancy of each pair to the overall dataset, using the TfIdf-weighted DTM results. All this data is collated to produce some illuminating visualizations.

[R SOURCE: Calculate Word Relevancy by Key Event w/Correlations]
cbe.stopwords = c( 
    mwc.stopwords,
    unique(c(tolower(directory$Moniker), stemDocument(tolower(directory$Moniker), language="en")))
)  

cbe <- data.frame(
          item1       = character(), 
          item2       = character(), 
          correlation = numeric(), 
          event       = numeric())

for (e in 1:nrow(keyEvents)) {
    group.ids <- md.eventData[md.eventData$event == e,"id"]
    group.words <- getWordsFromText(hrcemails, cbe.stopwords, group.ids) %>%
        filter(nchar(word) >= 3, !grepl(word, pattern="stategov[a-z]?$", perl=TRUE, ignore.case=TRUE))
  
    group.corr <- getWordCorrelations(group.words, limit=50) 
    group.corr$event <- e

    cbe <- cbe %>% rbind(group.corr) 
}

# remove redundant data
cbe <- cbe[seq(2, nrow(cbe), 2),] 

cbe$eventDate <- keyEvents$Date[cbe$event]
cbe$wordpair <- paste(cbe$item1, cbe$item2, sep = "::")
# since we stemmed our matrix, we want to get the weight of the stemmed items, for better overall normalization
cbe$stem1 <- stemDocument(cbe$item1, language="en")
cbe$stem2 <- stemDocument(cbe$item2, language="en")

#safety check to make sure we're working only with stems that are actually in the DTM
dtm.full <- cs2$getDTM(sparse=F)
cbe <- cbe %>% filter(stem1 %in% dtm.full$dimnames$Terms, stem2 %in% dtm.full$dimnames$Terms)

getRelevancyByEvent = function(e.data) {
    e <- e.data$event[1]
    event.terms <- unique(c(e.data$stem1, e.data$stem2))
    cat(sprintf("Compiling relevancy for event %d (%s subject terms): %s \n", e, comma_format()(length(event.terms)), keyEvents$EventName[e]))
    event.ids <- unique(as.character(md.eventData[md.eventData$event == e, "id"]))
    term.sums <- col_sums(dtm.full[event.ids, event.terms])
    e.data$rel1 <- term.sums[e.data$stem1]
    e.data$rel2 <- term.sums[e.data$stem2]
    
    cat(sprintf("     %s relevancy scores retrieved\n", comma_format()(nrow(e.data))))
    
    return(e.data)
}

cbe.gd <- cbe %>%
    by(cbe$event, getRelevancyByEvent) %>%
    rbind.fill() %>%
    group_by(event) %>%
    top_n(30, (rel1 + rel2)) %>%
    ungroup()
## Compiling relevancy for event 1 (61 subject terms): Obama Gives Speech in Cairo (Egypt) 
##      1,098 relevancy scores retrieved
## Compiling relevancy for event 2 (131 subject terms): Obama Awarded Nobel Peace Prize 
##      4,689 relevancy scores retrieved
## Compiling relevancy for event 3 (281 subject terms): Haiti Aid 
##      22,658 relevancy scores retrieved
## Compiling relevancy for event 4 (110 subject terms): START Treaty Signed to Reduce Nuclear Arms 
##      3,258 relevancy scores retrieved
## Compiling relevancy for event 5 (101 subject terms): G8 Summit 
##      4,025 relevancy scores retrieved
## Compiling relevancy for event 6 (245 subject terms): Obama Announces End to Iraq Combat Mission 
##      25,588 relevancy scores retrieved
## Compiling relevancy for event 7 (175 subject terms): Midterm elections 
##      12,230 relevancy scores retrieved
## Compiling relevancy for event 8 (229 subject terms): New START Treaty Signed with Russia 
##      21,412 relevancy scores retrieved
## Compiling relevancy for event 9 (185 subject terms): Osama Bin Laden killed 
##      12,817 relevancy scores retrieved
## Compiling relevancy for event 10 (280 subject terms): Obama announces end of Iraq War 
##      27,453 relevancy scores retrieved
## Compiling relevancy for event 11 (298 subject terms): Obama Freezes Iranian Assets 
##      32,640 relevancy scores retrieved
## Compiling relevancy for event 12 (174 subject terms): Benghazi Diplomatic Mission Attack 
##      11,320 relevancy scores retrieved
## Compiling relevancy for event 13 (78 subject terms): Obama Reelected 
##      2,064 relevancy scores retrieved

Graphing Term Association by Event and Relevancy

To make a graph, the data is prepped with a type column and organized into edges and nodes. Pairs of words with high correlations and high relevancy within a specific event column are organized such that the event association is the X axis, relevancy as the Y axis. Since the words appear more than once in the data set, terms that appear in more than one event are merged and averaged.

In the following graph, terms are color-coded terms as before, and additional indication of event association with uppercase representing those terms that are associated only with that event, lowercase terms appear in more than one event. Since terms that appear in more than one event are shown at the average event location, this tends to place more terms closer to the center of the visualization, and as a proportion, single-event terms tend to be more interesting than those that are associated with multiple events, though they may help qualify their paired term, as represented by the connecting arcs.

[R SOURCE: Plot Relevant Key Terms by Event with Word Correlations]
# rack and stack 'em 
getWordType = function(word) {
    if (word[1] %in% last.names) {
        return("Name")
    } else if (word[1] %in% countries$first | word[1] %in% countries$second) {
        return("Country")
    } else {
        return("Other")
    }
}
 
cbe.nodes <- 
    data.frame( 
        word        = c(as.character(cbe.gd$item1), as.character(cbe.gd$item2)),
        event       = c(cbe.gd$event, cbe.gd$event),
        correlation = c(cbe.gd$correlation, cbe.gd$correlation),
        relevancy   = c((cbe.gd$rel1 + cbe.gd$rel2)/2, (cbe.gd$rel1 + cbe.gd$rel2)/2),
        stringsAsFactors = FALSE
    ) %>%
    by(.$word,
            FUN = (
                function(w.data) {
                    w.data$wordType <- getWordType(w.data$word)
                  
                    avg.relevancy <- mean(w.data$relevancy)
                    w.data$relevancy <- avg.relevancy
                    
                    avg.correlation <- mean(w.data$correlation)
                    w.data$correlation <- avg.correlation
                  
                    w.data$minEvent <- min(w.data$event)
                    w.data$maxEvent <- max(w.data$event)
                    avg.event <- mean(w.data$event)
                    w.data$event <- avg.event
                    w.data$point <- ifelse(w.data$minEvent == avg.event & w.data$maxEvent == avg.event, 2, 1)
                    
                    w.data$x <- avg.event
                    w.data$y <- avg.relevancy
                    
                    return(unique(w.data))
                }
            )) %>%
    rbind.fill()

gg = graph_from_data_frame(cbe.gd, vertices=cbe.nodes, directed=TRUE) %>%
            ggraph(layout="manual", node.positions=cbe.nodes) +
            geom_edge_arc(aes(edge_alpha = correlation), edge_colour="DarkGray", show.legend=c(edge_alpha=FALSE), curvature = .05) +
            geom_node_point(aes(color=wordType, size=point), show.legend = c(color=TRUE, size=FALSE)) +
            geom_text_repel(aes(x=x, y=y, label = ifelse(point==2, toupper(name), tolower(name)), color=wordType), show.legend = FALSE, segment.size=0.25, point.padding = unit(0.2, "lines"), force=1.1) +
            scale_color_manual(name="Type", values=c(Other="DarkSlateGray", Name="Blue", Country="Red", Keyword="DarkOrange")) +
            geom_point(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=2.81), color="DarkOrange4") +
            geom_text(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=2.9, label=str_wrap(EventName, width = 12)), size=3, fontface="bold", color="DarkOrange4", vjust=0) + 
            ggtitle("Relevant Keywords Per Major Foreign Policy Event") + 
            scale_y_continuous(limits=c(2.8,max(cbe.nodes$y)), trans = "log10") +
            scale_size_continuous(range=c(1.5, 2)) +
            scale_edge_alpha(guide="none") +
            theme_safari() 
gg

Structured Word Columns per Event

The previous graph presents many variables at once, and helps to show the relative data, but its complexity may be distracting for some viewers. A more structured version with no term-associations may be preferable to some:

[R SOURCE: Plot Relevant Key Terms by Event - Simplified]
cbe.nodes2 <- 
    data.frame( 
        word        = c(as.character(cbe.gd$item1), as.character(cbe.gd$item2)),
        event       = c(cbe.gd$event, cbe.gd$event),
        relevancy   = c((cbe.gd$rel1+cbe.gd$rel2)/2, (cbe.gd$rel1+cbe.gd$rel2)/2),
        stringsAsFactors = FALSE
    ) %>%
    by(.$event, FUN=(function(w.data) {
        e <- w.data$event[1]
        cbe.related <- cbe.gd[cbe.gd$event == e,]
        
        r.data <- 
            data.frame(
                word              = character(), 
                related           = character(), 
                relevancy         = numeric(), 
                related.relevancy = numeric(),
                related.rank      = integer(),
                event             = numeric(), 
                rank              = numeric()
            )
        seen.words = vector()
         
        sorted.words <- w.data[order(-w.data$relevancy),]$word
        
        for (w in sorted.words) {
            if (!(w %in% seen.words)) {
                related.words <- c(cbe.related[cbe.related$item1 == w,]$item2, cbe.related[cbe.related$item2 == w,]$item1)
                related.rel   <- c(cbe.related[cbe.related$item1 == w,]$rel2, cbe.related[cbe.related$item2 == w,]$rel1)
                seen.words <- unique(c(w, seen.words))
                #compose the data
                r.row <- 
                    data.frame(
                        word              = w,
                        wordType          = getWordType(w),
                        related           = I(list(related.words)),
                        relevancy         = max(w.data[w.data$word == w,"relevancy"]),
                        related.relevancy = I(list(related.rel)),
                        related.rank      = I(list(rank(related.rel, ties.method="first"))),
                        event             = e,
                        rank              = NA
                    )
              
                r.data <- rbind(r.data, r.row)
            }
        }

        r.data$rank = rank(r.data$relevancy, ties.method="first")
        
        return(r.data)
    })) %>%
    rbind.fill() %>%
    unnest()

cbe.nodes2$related.type <- sapply(cbe.nodes2$related, getWordType)
cbe.gd3 = data.frame(
    item1 = paste(cbe.gd$item1, cbe.gd$event, sep="."),
    item2 = paste(cbe.gd$item2, cbe.gd$event, sep="."),
    correlation = cbe.gd$correlation
)

cbe.nodes3 <- unique(data.frame(
                  name      = paste(cbe.nodes2$word, cbe.nodes2$event, sep="."),
                  x         = cbe.nodes2$event,
                  y         = cbe.nodes2$rank,
                  relevancy = cbe.nodes2$relevancy
              ))

graph_from_data_frame(cbe.gd3, vertices=cbe.nodes3) %>%
            ggraph(layout="manual", node.positions=cbe.nodes3) +
            geom_node_text(data=cbe.nodes2, aes(label = word, x = event, y = rank, size=rank, color=wordType),  show.legend=c(size=FALSE, color=FALSE)) +
            geom_point(data=cbe.nodes2, aes(x = event, y = rank, color=wordType), alpha=0, show.legend=TRUE) +
            geom_point(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=0), color="DarkOrange4") +
            geom_text(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=-1.6, label=str_wrap(EventName, width = 12)), size=4, color="DarkOrange4", vjust=1, lineheight = .9) +
            geom_text(data=keyEvents, mapping=aes(x = 1:nrow(keyEvents), y=-.5, label=format(Date, "%m/%d/%Y")), size=4, fontface="bold", color="DarkOrange4", vjust=1) +
            ggtitle("High Relevancy Terms by Major Foriegn Policy Events") +
            scale_y_continuous(limits=c(-4.6,max(cbe.nodes3$y))) +
            scale_size_continuous(range=c(3.5,4.75), guide="none") +
            scale_color_manual(name="Type", values=c(Other="DarkSlateGray", Name="Blue", Country="Red", Keyword="DarkOrange"), guide="legend") +
            guides(colour = guide_legend("Type", override.aes = list(size = 4, alpha = 1), reverse=TRUE)) +
            theme_safari() + theme(plot.title = element_text(hjust=0.5))

Theme River

A more complex form of visualization that tracks specific topics over time is a “theme river.” For an example with a nice interactivity model, see this paper on “RoseRiver”, which is a more complicated pursuit, but whose methodology is similar to what is shown here in its selection of topics over time.5

Summary

With a little work, basic text-mining can go beyond simple statistical queries to provide more useful visualizations and surface more interesting insights. Starting with raw, unstructured text that was the imperfect result of OCR processing of images, content was processed, cleaned, error-corrected, and normalized. Following this, additional use-case specific actions were taken:

  • Removal of “noisy” stop words specific to the corpus
  • Correlation of terms and bigrams with identified key players
  • Correlation of terms and bigrams with major events

This data was then plotted in a variety of novel visualizations to illustrate these associations in ways simple search results, search filters, and simple bar charts or pie charts could not.

The techniques shown are demonstrative of what can be accomplished with existing basic text mining tools. Additional insights may arise from deeper inspection and application of the latest machine-learning capabilities.

© 2018. Mike Safar. All rights reserved.


  1. Hillary Clinton’s email display name is “H”, such a delicate parsing puzzle that it hardly seems accidental.

  2. Multiple labels could be provided on each one, but for this discussion, less is more.

  3. This source data has been adapted from the Ngrams.info website and cannot be redistributed, but can easily be downloaded from their site.

  4. Each of the input files required some tweaking. For example, the list of first names initially didn’t have “Barack” on it, and some names are actually common words like “will”. This kind of tuning is always ongoing; for example, common names change by generation, and immigration patterns change what names might show up in a population. This kind of grooming is necessary to for production systems, but the benefits to selective tuning, for example in an application such as handling a large-scale ediscovery case, are considerable, especially if the tuning can be handled on a per-case basis.

  5. Cui, Weiwei and Liu, Shixia and Wu, Zhuofeng and Wei, Hao. “How Hierarchical Topics Evolve in Large Text Corpora,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, iss. 12, pp. 2281-2290, November 2014.